[TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend by xxi-nv · Pull Request #14608 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-05-27T03:08:15Z

Summary

Adds MegaMoECuteDsl, a new NVFP4 MoE backend (SM100 / SM103) that wraps the fused MegaMoE CuteDSL kernel Sm100MegaMoEKernel — dispatch + FC1 + activation + FC2 + combine fused into a single kernel.

The kernel is imported from https://gitlab-master.nvidia.com/bangyus/cutedsl_megamoe/-/merge_requests/8 into tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/ (lazy-loaded, so non-SM100 / non-cu13 environments still import the backend for capability probing).

Changes

Backend mega_moe/mega_moe_cute_dsl.py — ConfigurableMoE-compatible, FUSED_COMM scheduler, EP-only; one unified path for single- and multi-rank (only the buffer source branches on ep_size).
Custom op + runner custom_ops/cute_dsl_megamoe_custom_op.py — standard CuteDSL TunableRunner / torch-op pattern; multi-rank symmetric memory via torch.distributed._symmetric_memory.
Quant method NVFP4MegaMoECuteDslMethod — NVFP4 weight lifecycle (gate/up interleave + to_blocked swizzle + per-expert fc31_alpha / fc2_alpha / fc1_norm_const), EPLB-aware.
Wiring — factory (create_moe.py), FusedCommMoEScheduler zero-token delegation, MoE developer guide, unit tests.

Test plan

pytest tests/unittest/_torch/modules/moe/ -k MEGAMOE_CUTEDSL — single- / multi-rank backend, module, and EPLB cases on a GB200 with the cu13 cutlass-dsl runtime.

Preview / WIP — staged for follow-up perf + accuracy validation. The factory falls back to CutlassFusedMoE whenever the backend cannot serve the model (can_implement).

coderabbitai · 2026-05-27T03:21:18Z

📝 Walkthrough

Walkthrough

This PR introduces a complete MegaMoE CuteDSL NVFP4 fused-communication backend that enables multi-rank expert routing with symmetric memory, token communication overlapping, and persistent scheduling. The implementation spans ~18K lines across kernel DSL code, custom op registration, backend class, quantization methods, and test infrastructure, featuring a flattened kernel package architecture with lazy imports to defer expensive CuteDSL symbol resolution.

Changes

MegaMoE CuteDSL NVFP4 Backend

Layer / File(s)	Summary
Tooling setup `.pre-commit-config.yaml`, `legacy-files.txt`, `pyproject.toml`, `ruff-legacy.toml`, `tensorrt_llm/_torch/autotuner.py`	Configuration files updated to include mega_moe_nvfp4 kernel package files in linting/formatting hooks. FP8 tensor generation added to autotuner with uint8-view pattern.
Kernel package init & core infrastructure `tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/__init__.py`, `config.py`, `megamoe_constants.py`, `blocked_scale.py`, `contract.py`	Package docstring and lazy-loading interface; DSV4Config configuration structures; blocked-scale tensor swizzling for NVFP4 atoms; finite coordinate Space and Contract mapping framework for RMEM tensor handoff.
Device-side persistent scheduler & work-tile `moe_persistent_scheduler.py`	MoEWorkTileInfo payload with MLIR serialization; MoESchedExtension interface for work-tile enrichment; static and dynamic (CLC-based) persistent schedulers with short-side-first raster and expert boundary caching.
Device-side utilities & TMA descriptors `moe_utils.py`, `grid_sync.py`, `iket_compat.py`, `ptx_helpers.py`, `sf_swizzle.py`	PTX-level grid-sync barrier; TMA 1D load/store helpers; DSMEM cluster communication; pointer/address conversions; scale-factor atom layout swizzling; IKET dialect compatibility wrapper; online TMA descriptor workspace and grouping constructors.
Fused FC1+FC2 scheduler `fc1_fc2_fuse_sched.py`	Persistent tile scheduler for fused fc1+fc2 with group→phase→expert state machine, greedy group formation via group_hint, and atomic-counter load-balancing option.
Dispatch kernel & scheduler extension `dispatch_kernel.py`, `custom_ext.py`	3-stage dispatch flow (prep/barrier/pull) with TMA token pulls and cross-rank NVLink signaling. SwapAB-aware scheduler extension with phase-decoded peek-ready bits for fc1/fc2 readiness.
MegaMoE kernel launcher `megamoe_kernel.py`	Workspace region layout, token communication hooks, TokenCommArgs serialization, and integration of dispatch/fc1/fc2 phases with NVLink barrier ordering.
Torch custom op & runner `tensorrt_llm/_torch/custom_ops/cute_dsl_megamoe_custom_op.py`, `__init__.py`	Registers `cute_dsl_megamoe_nvfp4_blackwell` op with tactic validation, symmetric-memory provider, local workspace caching, and Sm100MegaMoENvfp4Runner for kernel compilation/execution via autotuner.
MegaMoECuteDsl backend `tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_cute_dsl.py`	Backend class with `can_implement()` gating, EP process-group resolution, weight creation/loading, `quantize_input()` for NVFP4 packing and FP8 SF padding, and `run_moe()` with kernel invocation and form-A reduction.
NVFP4 weight lifecycle `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	NVFP4MegaMoECuteDslMethod with MegaMoE-format weight registration, expert weight staging, blocked-scale transformation via `to_blocked`, EPLB shared staging, and v1 alpha gating.
Module & factory integration `mega_moe/__init__.py`, `create_moe.py`, `configurable_moe.py`, `moe_scheduler.py`	Module exports updated for CuteDSL classes/methods. Factory adds pretrained capability helper and backend selection. ConfigurableMoE defers validation. Scheduler delegates quantize_input to backend for zero-token chunks. DeepGemm adds empty-input handling.
Test infrastructure `tests/unittest/.../moe_test_utils.py`, `test_moe_backend.py`, `test_moe_module.py`, `tests/microbenchmarks/.../backend.py`, `utils.py`	Test utilities split MEGAMOE into MEGAMOE_DEEPGEMM and MEGAMOE_CUTEDSL with backend-specific skip predicates. Backend/module tests updated with CuteDSL-specific multi-GPU and EPLB generators. Benchmark tests use explicit backend enum.
Documentation `MOE_DEVELOPER_GUIDE.md`, `mega_moe/MEGAMOE_CUTEDSL_DESIGN.md`	Comprehensive design spec for CuteDSL backend integration, hard gates, call chain, runner/autotuner design, quantization spec, and test plan. Guide updated with backend details, capability matrix, and FUSED_COMM anti-patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR is large (18K+ lines) and heterogeneous, spanning device-side CUDA kernel DSL code, host-side PyTorch backend infrastructure, quantization pipelines, and test coverage. Key areas of complexity: persistent scheduler state machine with multiple advancement modes; dispatch kernel 3-stage barrier with cross-rank NVLink signaling; atomic-counter load-balancing option; MLIR serialization/deserialization for DSL interoperability; symmetric-memory provider caching; tactic validation and compile caching; weight transformation pipeline with blocked-scale swizzling; v1 alpha gating. Many changes are implementations of analogous patterns (e.g., scheduler variants, TMA descriptor constructors, skip predicates), which reduces per-file review friction but requires understanding the underlying design principles. The design doc and developer guide are essential context for review.

Suggested reviewers

nv-guomingz
syuoni
leslie-fang25
yuxianq
mingyangHao
tomeras91

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (11)

tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md (1)
152-152: ⚡ Quick win

Clarify symmetric-memory provider terminology.

The description states "requires CUDA 13 Cutlass DSL runtime (PR #14354) and NVSHMEM provider". Based on the design doc (phase 0, line 938-939), the implementation uses torch.distributed._symmetric_memory (cuMem-based), not an external NVSHMEM package. Consider revising to "symmetric memory provider" or "torch.distributed._symmetric_memory" to avoid implying a dependency on nvshmem4py-cu13, which was discussed in the design doc (lines 533-534) but not chosen for v1.
📝 Proposed clarification
-| `mega_moe/mega_moe_cute_dsl.py` | `MegaMoECuteDsl` | SM100/SM103 | NVFP4 via ported CuteDSL `Sm100MegaMoEKernel` fused dispatch+FC1+act+FC2+combine kernel; requires CUDA 13 Cutlass DSL runtime (PR `#14354`) and NVSHMEM provider (hard gate — see `mega_moe/MEGAMOE_CUTEDSL_DESIGN.md`); v1 alpha=1 product gate enforced in `post_load_weights` | `FUSED_COMM` |
+| `mega_moe/mega_moe_cute_dsl.py` | `MegaMoECuteDsl` | SM100/SM103 | NVFP4 via ported CuteDSL `Sm100MegaMoEKernel` fused dispatch+FC1+act+FC2+combine kernel; requires CUDA 13 Cutlass DSL runtime (PR `#14354`) and symmetric memory provider (`torch.distributed._symmetric_memory`); v1 alpha=1 product gate enforced in `post_load_weights` | `FUSED_COMM` |
Based on learnings from MEGAMOE_CUTEDSL_DESIGN.md phase 0: the implementation uses PyTorch's torch.distributed._symmetric_memory (cuMem-based NVSHMEM-equivalent), not an external NVSHMEM package.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md` at line 152,
Update the README entry describing mega_moe/mega_moe_cute_dsl.py /
MegaMoECuteDsl to avoid implying an external NVSHMEM package: change "NVSHMEM
provider" to "symmetric memory provider (torch.distributed._symmetric_memory)"
or similar wording that explicitly names PyTorch's cuMem-based provider; ensure
the note about requiring CUDA 13 Cutlass DSL runtime and the product gate
references remain unchanged so readers know runtime and gating requirements.
tensorrt_llm/_torch/custom_ops/cute_dsl_megamoe_custom_op.py (1)
706-706: 💤 Low value

Consider ClassVar annotation for shared mutable class attribute.

The kernel_cache is intentionally shared across all instances for compile caching. Adding ClassVar makes this intent explicit and silences the RUF012 warning:
+from typing import ClassVar
+
 class Sm100MegaMoENvfp4Runner(TunableRunner):
     ...
     # Module-scope compile cache shared by every runner instance.
-    kernel_cache: dict = {}
+    kernel_cache: ClassVar[dict] = {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/custom_ops/cute_dsl_megamoe_custom_op.py` at line 706,
The class-level mutable cache kernel_cache is intended to be shared across
instances but lacks an explicit ClassVar annotation; update the declaration of
kernel_cache in the class to be typed as typing.ClassVar[dict] (and add from
typing import ClassVar if not already imported) so the intent is explicit and
the RUF012 warning is silenced while preserving the shared compile cache
behavior.
tests/unittest/_torch/modules/moe/test_moe_backend.py (2)
568-577: ⚡ Quick win

Add a companion negative test for the v1 alpha gate.

Setting TRTLLM_MEGAMOE_CUTEDSL_BYPASS_V1_ALPHA_GATE=1 for this path is useful for end-to-end execution, but it also removes direct coverage that the default gate rejects non-1 alpha checkpoints. Please add a small paired test with the env var unset that asserts rejection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_backend.py` around lines 568 -
577, Add a companion negative test that mirrors the MEGAMOE_CUTEDSL branch but
ensures the v1 alpha gate is enforced: when backend_type ==
MoeBackendType.MEGAMOE_CUTEDSL, explicitly ensure TR
TLLM_MEGAMOE_CUTEDSL_BYPASS_V1_ALPHA_GATE is unset (use monkeypatch.delenv or
monkeypatch.setenv with None), then exercise the same load -> post-load ->
run_moe path used in the positive case and assert that the load/post-load step
fails (raises/rejects) due to non-1 alpha values; place this next to the
existing positive branch in
tests/unittest/_torch/modules/moe/test_moe_backend.py so it uses the same setup
and failure assertion to prove the gate blocks non-1 alpha checkpoints.
307-315: QA integration list updates are unnecessary for this change-set.

This file only expands unit-test backend coverage; no tests/integration/defs/ entries were added or materially changed, so tests/integration/test_lists/qa/* updates are not required in this PR.

As per coding guidelines: "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_backend.py` around lines 307 -
315, Add a brief explicit note that QA integration list updates are unnecessary
because this change only expands unit-test backend coverage and does not modify
tests/integration/defs; place the comment immediately above
BACKEND_TYPES_TO_TEST (referencing BACKEND_TYPES_TO_TEST, MoeBackendType.*
entries) so reviewers and release/QA scripts know no
tests/integration/test_lists/qa/* edits are required.
tests/microbenchmarks/bench_moe/utils.py (1)
66-79: 💤 Low value

_ensure_dist_for_megamoe only handles MEGAMOE_DEEPGEMM, not MEGAMOE_CUTEDSL.

This function checks only for MEGAMOE_DEEPGEMM but test_moe_module.py:_ensure_dist_for_megamoe (lines 139-143) checks for both MegaMoE variants. If bench_moe is ever used with MEGAMOE_CUTEDSL, the distributed process group won't be initialized.

Since MEGAMOE_CUTEDSL isn't in the benchmark's MoeBackendType enum yet (preview phase), this is acceptable for now but should be updated when CuteDSL is added to the benchmark registry.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/utils.py` around lines 66 - 79, The
early-return in _ensure_dist_for_megamoe only recognizes
MoeBackendType.MEGAMOE_DEEPGEMM, so the NCCL ProcessGroup will not be
initialized for the other MegaMoE variant; update _ensure_dist_for_megamoe to
also accept the MEGAMOE_CUTEDSL backend (or generally any future MegaMoE enum
value) by checking for both MoeBackendType.MEGAMOE_DEEPGEMM.value and
MoeBackendType.MEGAMOE_CUTEDSL.value (or by matching a common prefix/enum
category), keeping the rest of the initialization (CUDA check, env vars,
dist.init_process_group) unchanged; reference function name
_ensure_dist_for_megamoe and the test file
test_moe_module.py:_ensure_dist_for_megamoe to ensure parity with tests and add
a TODO comment to revisit when CuteDSL is formally added to the benchmark
registry.
tests/microbenchmarks/bench_moe/backend.py (1)
121-124: 💤 Low value

Consider adding MEGAMOE_CUTEDSL to the benchmark registry.

The MoeBackendType enum and get_backend_class dispatch only include MEGAMOE_DEEPGEMM. The CuteDSL variant is missing, which means bench_moe cannot benchmark the new backend.

This is acceptable for the preview phase, but should be addressed before the feature graduates to production.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/backend.py` around lines 121 - 124, Add the
missing CuteDSL backend to the registry by extending the MoeBackendType enum
with a MEGAMOE_CUTEDSL member and updating the get_backend_class dispatch to
handle it: import the CuteDSL backend implementation (e.g., the MegaMoE CuteDSL
class) alongside MegaMoEDeepGemm and return that class when backend_type ==
MoeBackendType.MEGAMOE_CUTEDSL; ensure you reference the new enum value
(MoeBackendType.MEGAMOE_CUTEDSL) and the CuteDSL class name (e.g.,
MegaMoECuteDSL) in the dispatch so bench_moe can instantiate the CuteDSL
backend.
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_persistent_scheduler.py (2)
1926-1927: ⚡ Quick win

Prefix unused variable with underscore.

The n_idx variable from the CLC response unpacking is never used. Prefix it with underscore to indicate it's intentionally ignored.
Suggested fix
-        m_idx, n_idx, l_idx, is_valid = cute.arch.clc_response(
+        m_idx, _n_idx, l_idx, is_valid = cute.arch.clc_response(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_persistent_scheduler.py`
around lines 1926 - 1927, The unpacking of the CLC response in the call to
cute.arch.clc_response currently assigns an unused variable n_idx; change that
identifier to a prefixed name (e.g., _n_idx) so it signals intentional ignore.
Locate the line where m_idx, n_idx, l_idx, is_valid =
cute.arch.clc_response(...) (inside moe_persistent_scheduler.py) and replace
n_idx with _n_idx in the tuple assignment; leave m_idx, l_idx and is_valid
unchanged.
7-7: 💤 Low value

Consider using built-in types instead of typing module.

Per coding guidelines, prefer built-in list, tuple over typing.List, typing.Tuple for Python 3.10+. Also prefer X | None over Optional[X].

This is a low-priority cleanup given the file size.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_persistent_scheduler.py`
at line 7, Replace typing.List, typing.Tuple, and typing.Optional usages with
built-in generic types and PEP 604 unions: change annotations like List[X] ->
list[X], Tuple[A, B] -> tuple[A, B], and Optional[T] -> T | None; remove List,
Optional, Tuple from the import line in moe_persistent_scheduler.py and
keep/import only Literal if still needed; update all function/method signatures
and variable annotations that reference List, Tuple, or Optional accordingly
(search for usages of List, Tuple, Optional in the file and replace them with
list, tuple, and | None forms).
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/blocked_scale.py (1)
33-33: 💤 Low value

Prefer built-in list over typing.List.

Per coding guidelines, prefer built-in types like list over legacy typing.List for Python 3.10+.
Suggested fix
-from typing import List
+from collections.abc import Sequence
Then update function signatures:
-def cat_byte_reinterpretable_tensors(tensors: List[torch.Tensor], dim: int = 0) -> torch.Tensor:
+def cat_byte_reinterpretable_tensors(tensors: list[torch.Tensor], dim: int = 0) -> torch.Tensor:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/blocked_scale.py` at line
33, Replace the legacy typing.List import and usages with the built-in list
type: remove "from typing import List" and update all annotations in this module
(e.g., function signatures, return types, and variable annotations that
reference List) to use the native list[...] form (for example change "List[T]"
to "list[T]"); ensure functions/methods in this file that reference List (search
for symbols like any function or class definitions using List) are updated
accordingly and the typing import is deleted.
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_constants.py (1)
5-13: 💤 Low value

Consider UPPER_SNAKE_CASE for module-level constants.

Per coding guidelines, Python constants should use UPPER_SNAKE_CASE (e.g., NVFP4_BLOCK_SIZE, SF_PADDING_BLOCK). The current PascalCase naming is internally consistent but deviates from the project convention.

Given this is a preview PR, this can be deferred if the naming aligns with an upstream source being ported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_constants.py`
around lines 5 - 13, Module-level constants use PascalCase; rename them to
UPPER_SNAKE_CASE to match project conventions: change Nvfp4BlockSize ->
NVFP4_BLOCK_SIZE, SfPaddingBlock -> SF_PADDING_BLOCK, TmaLeadingDimByteAlign ->
TMA_LEADING_DIM_BYTE_ALIGN, Nvfp4E2M1Max -> NVFP4_E2M1_MAX, Fp8E4M3FNMax ->
FP8_E4M3_FN_MAX, SupportedMmaTileM -> SUPPORTED_MMA_TILE_M, SupportedMmaTileN ->
SUPPORTED_MMA_TILE_N, and update all internal references/usages of these symbols
accordingly; if external consumers may rely on the old names, add short-lived
aliases mapping the old names to the new constants in the same module to
preserve backward compatibility (e.g., Nvfp4BlockSize = NVFP4_BLOCK_SIZE) and
add a TODO to remove aliases later.
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_kernel.py (1)
100-100: 💤 Low value

Modernize type hints to use built-in types.

The coding guidelines prefer built-in types (list, dict, tuple) over legacy typing equivalents, and X | None over Optional[X]. This file imports and uses List, Dict, Tuple, Optional throughout.
♻️ Suggested import and usage update
-from typing import Any, Dict, List, Optional, Tuple, Type
+from typing import Any
Then update usages throughout, e.g.:

List[ir.Value] → list[ir.Value]

Dict[str, int] → dict[str, int]

Tuple[int, ...] → tuple[int, ...]

Optional[int] → int | None

Type[cutlass.Numeric] → type[cutlass.Numeric]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_kernel.py` at
line 100, Replace legacy typing generics with built-in generics and PEP 604
unions: remove List, Dict, Tuple, Optional, Type from the typing import and use
list, dict, tuple, X | None, and type[...] in their places; e.g., change imports
on the line with "from typing import Any, Dict, List, Optional, Tuple, Type" to
keep only Any (if needed) and update all occurrences like
List[ir.Value]→list[ir.Value], Dict[str,int]→dict[str,int],
Tuple[int,...]→tuple[int,...], Optional[int]→int | None, and
Type[cutlass.Numeric]→type[cutlass.Numeric] throughout megamoe_kernel.py (search
for those symbols: List, Dict, Tuple, Optional, Type and replace accordingly).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/custom_ext.py`:
- Around line 346-352: The code builds sf_layout = tile_atom_to_shape_SF(...)
but then throws away the swizzled layout by calling
cute.make_layout(sf_layout.shape, stride=stride) when constructing the tensor;
update the cute.make_tensor call for the "sfa" branch (and the analogous
"sfb"/"sfc" branches) to pass the full sf_layout (the layout object returned by
tile_atom_to_shape_SF) into cute.make_tensor instead of reconstructing a plain
strided layout, so the atom-swizzled layout is preserved for the descriptor
path.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/fc1_fc2_fuse_sched.py`:
- Around line 890-909: The mbarrier expect_tx must be set before issuing the
async cluster store to avoid a race where the store completes before the peer is
armed; move the mbarrier_arrive_expect_tx_on_peer(...) call to precede
store_i32_to_peer_cluster_smem_async(...) in the path handling lane_idx <
Int32(cluster_size) so the peer's expect_tx is armed (matching the Int32(4) tx
size) before the st.async.shared::cluster write; reference
store_i32_to_peer_cluster_smem_async, mbarrier_arrive_expect_tx_on_peer, and
consumer_wait when making the reorder.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_utils.py`:
- Around line 954-955: The constructor/signature should not allow expert_cnt or
dependency to be None because construct_and_write() immediately uses dependency
and self.expert_cnt; update the signatures (remove Optional[...] = None and
default None) so expert_cnt and dependency are required parameters, and/or add
an early validation in construct_and_write() that raises a ValueError if
dependency is None or self.expert_cnt is None (referencing construct_and_write,
self.expert_cnt, and dependency), and apply the same change to the other
occurrence around lines 1014-1021 so both call sites enforce non-None values.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_cute_dsl.py`:
- Around line 436-454: Replace the assertion-based argument checks in
MegaMoECuteDsl's initializer with explicit validation that raises ValueError:
change the two assert checks that reference self.tp_size and self.cluster_size
to if ...: raise ValueError(...) and similarly change the assert that checks
self.ep_size == self.parallel_size when self.use_dp and self.parallel_size > 1
to an if that raises ValueError; keep the existing error messages (adjusted to
the ValueError) and keep the surrounding logic for num_slots/ep_size as-is.
- Around line 708-711: The input validation in MegaMoECuteDsl.load_weights uses
an assert which is skipped with Python -O; change this to an explicit check and
raise a ValueError when the length is not 1 (e.g., if len(weights) != 1: raise
ValueError("MegaMoECuteDsl.load_weights expects a single-element list, got
{len(weights)} entries.")) so callers always get a clear exception; update the
same error message currently used by the assert.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 3535-3549: The code crashes when per_slot is empty because
stack_byte_reinterpretable_tensors is called with an empty list; in
_build_mega_sf (and similarly in the other block), detect the zero-slot case
(e.g., if num_slots == 0 or per_slot == []) before stacking and return an
appropriately shaped empty/padded tensor (torch.zeros with shape (0, flat_size)
or (num_slots, flat_size), matching dtype=stacked.dtype/torch.uint8 and
device=device) so downstream callers expecting a (num_slots, flat_size) tensor
do not fail; implement this early-return right after constructing/validating
per_slot and before calling stack_byte_reinterpretable_tensors.
- Around line 3205-3212: The code assumes a 2x gated expansion
(expand_intermediate == 2 * intermediate) but never enforces it, causing
mis-sized MegaMoE buffers; add an explicit validation early in the weight
construction path (e.g., in create_weights) that checks
expand_intermediate_size_per_partition == 2 * intermediate and raise a clear
exception if not, and mirror the same guard in fc1_sf_flat_size (or call the
validated value) so fc1_sf_flat_size, the [w3|w1] split/interleave logic and
_build_mega_format_buffers all receive a guaranteed 2x expansion; reference
expand_intermediate_size_per_partition, create_weights, fc1_sf_flat_size, and
_build_mega_format_buffers when adding the assertion and error message.

---

Nitpick comments:
In `@tensorrt_llm/_torch/custom_ops/cute_dsl_megamoe_custom_op.py`:
- Line 706: The class-level mutable cache kernel_cache is intended to be shared
across instances but lacks an explicit ClassVar annotation; update the
declaration of kernel_cache in the class to be typed as typing.ClassVar[dict]
(and add from typing import ClassVar if not already imported) so the intent is
explicit and the RUF012 warning is silenced while preserving the shared compile
cache behavior.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/blocked_scale.py`:
- Line 33: Replace the legacy typing.List import and usages with the built-in
list type: remove "from typing import List" and update all annotations in this
module (e.g., function signatures, return types, and variable annotations that
reference List) to use the native list[...] form (for example change "List[T]"
to "list[T]"); ensure functions/methods in this file that reference List (search
for symbols like any function or class definitions using List) are updated
accordingly and the typing import is deleted.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_constants.py`:
- Around line 5-13: Module-level constants use PascalCase; rename them to
UPPER_SNAKE_CASE to match project conventions: change Nvfp4BlockSize ->
NVFP4_BLOCK_SIZE, SfPaddingBlock -> SF_PADDING_BLOCK, TmaLeadingDimByteAlign ->
TMA_LEADING_DIM_BYTE_ALIGN, Nvfp4E2M1Max -> NVFP4_E2M1_MAX, Fp8E4M3FNMax ->
FP8_E4M3_FN_MAX, SupportedMmaTileM -> SUPPORTED_MMA_TILE_M, SupportedMmaTileN ->
SUPPORTED_MMA_TILE_N, and update all internal references/usages of these symbols
accordingly; if external consumers may rely on the old names, add short-lived
aliases mapping the old names to the new constants in the same module to
preserve backward compatibility (e.g., Nvfp4BlockSize = NVFP4_BLOCK_SIZE) and
add a TODO to remove aliases later.

In `@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_kernel.py`:
- Line 100: Replace legacy typing generics with built-in generics and PEP 604
unions: remove List, Dict, Tuple, Optional, Type from the typing import and use
list, dict, tuple, X | None, and type[...] in their places; e.g., change imports
on the line with "from typing import Any, Dict, List, Optional, Tuple, Type" to
keep only Any (if needed) and update all occurrences like
List[ir.Value]→list[ir.Value], Dict[str,int]→dict[str,int],
Tuple[int,...]→tuple[int,...], Optional[int]→int | None, and
Type[cutlass.Numeric]→type[cutlass.Numeric] throughout megamoe_kernel.py (search
for those symbols: List, Dict, Tuple, Optional, Type and replace accordingly).

In
`@tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_persistent_scheduler.py`:
- Around line 1926-1927: The unpacking of the CLC response in the call to
cute.arch.clc_response currently assigns an unused variable n_idx; change that
identifier to a prefixed name (e.g., _n_idx) so it signals intentional ignore.
Locate the line where m_idx, n_idx, l_idx, is_valid =
cute.arch.clc_response(...) (inside moe_persistent_scheduler.py) and replace
n_idx with _n_idx in the tuple assignment; leave m_idx, l_idx and is_valid
unchanged.
- Line 7: Replace typing.List, typing.Tuple, and typing.Optional usages with
built-in generic types and PEP 604 unions: change annotations like List[X] ->
list[X], Tuple[A, B] -> tuple[A, B], and Optional[T] -> T | None; remove List,
Optional, Tuple from the import line in moe_persistent_scheduler.py and
keep/import only Literal if still needed; update all function/method signatures
and variable annotations that reference List, Tuple, or Optional accordingly
(search for usages of List, Tuple, Optional in the file and replace them with
list, tuple, and | None forms).

In `@tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md`:
- Line 152: Update the README entry describing mega_moe/mega_moe_cute_dsl.py /
MegaMoECuteDsl to avoid implying an external NVSHMEM package: change "NVSHMEM
provider" to "symmetric memory provider (torch.distributed._symmetric_memory)"
or similar wording that explicitly names PyTorch's cuMem-based provider; ensure
the note about requiring CUDA 13 Cutlass DSL runtime and the product gate
references remain unchanged so readers know runtime and gating requirements.

In `@tests/microbenchmarks/bench_moe/backend.py`:
- Around line 121-124: Add the missing CuteDSL backend to the registry by
extending the MoeBackendType enum with a MEGAMOE_CUTEDSL member and updating the
get_backend_class dispatch to handle it: import the CuteDSL backend
implementation (e.g., the MegaMoE CuteDSL class) alongside MegaMoEDeepGemm and
return that class when backend_type == MoeBackendType.MEGAMOE_CUTEDSL; ensure
you reference the new enum value (MoeBackendType.MEGAMOE_CUTEDSL) and the
CuteDSL class name (e.g., MegaMoECuteDSL) in the dispatch so bench_moe can
instantiate the CuteDSL backend.

In `@tests/microbenchmarks/bench_moe/utils.py`:
- Around line 66-79: The early-return in _ensure_dist_for_megamoe only
recognizes MoeBackendType.MEGAMOE_DEEPGEMM, so the NCCL ProcessGroup will not be
initialized for the other MegaMoE variant; update _ensure_dist_for_megamoe to
also accept the MEGAMOE_CUTEDSL backend (or generally any future MegaMoE enum
value) by checking for both MoeBackendType.MEGAMOE_DEEPGEMM.value and
MoeBackendType.MEGAMOE_CUTEDSL.value (or by matching a common prefix/enum
category), keeping the rest of the initialization (CUDA check, env vars,
dist.init_process_group) unchanged; reference function name
_ensure_dist_for_megamoe and the test file
test_moe_module.py:_ensure_dist_for_megamoe to ensure parity with tests and add
a TODO comment to revisit when CuteDSL is formally added to the benchmark
registry.

In `@tests/unittest/_torch/modules/moe/test_moe_backend.py`:
- Around line 568-577: Add a companion negative test that mirrors the
MEGAMOE_CUTEDSL branch but ensures the v1 alpha gate is enforced: when
backend_type == MoeBackendType.MEGAMOE_CUTEDSL, explicitly ensure TR
TLLM_MEGAMOE_CUTEDSL_BYPASS_V1_ALPHA_GATE is unset (use monkeypatch.delenv or
monkeypatch.setenv with None), then exercise the same load -> post-load ->
run_moe path used in the positive case and assert that the load/post-load step
fails (raises/rejects) due to non-1 alpha values; place this next to the
existing positive branch in
tests/unittest/_torch/modules/moe/test_moe_backend.py so it uses the same setup
and failure assertion to prove the gate blocks non-1 alpha checkpoints.
- Around line 307-315: Add a brief explicit note that QA integration list
updates are unnecessary because this change only expands unit-test backend
coverage and does not modify tests/integration/defs; place the comment
immediately above BACKEND_TYPES_TO_TEST (referencing BACKEND_TYPES_TO_TEST,
MoeBackendType.* entries) so reviewers and release/QA scripts know no
tests/integration/test_lists/qa/* edits are required.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: def9f1a9-762b-42f1-b770-485392227501

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd96d6 and aa3a318.

📒 Files selected for processing (40)

.pre-commit-config.yaml
legacy-files.txt
pyproject.toml
ruff-legacy.toml
tensorrt_llm/_torch/autotuner.py
tensorrt_llm/_torch/custom_ops/__init__.py
tensorrt_llm/_torch/custom_ops/cute_dsl_megamoe_custom_op.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/__init__.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/blocked_scale.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/config.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/contract.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/custom_ext.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/dispatch_kernel.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/epilogue.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/fc1_fc2_fuse_sched.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/grid_sync.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/iket_compat.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/kernel_fc12.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_constants.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/megamoe_kernel.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_persistent_scheduler.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/moe_utils.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/ptx_helpers.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/sf_swizzle.py
tensorrt_llm/_torch/cute_dsl_kernels/mega_moe_nvfp4/sym_buffer.py
tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md
tensorrt_llm/_torch/modules/fused_moe/__init__.py
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/MEGAMOE_CUTEDSL_DESIGN.md
tensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py
tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/microbenchmarks/bench_moe/backend.py
tests/microbenchmarks/bench_moe/utils.py
tests/unittest/_torch/modules/moe/moe_test_utils.py
tests/unittest/_torch/modules/moe/test_moe_backend.py
tests/unittest/_torch/modules/moe/test_moe_module.py

tensorrt-cicd · 2026-06-12T12:28:57Z

PR_Github #53881 [ run ] completed with state FAILURE. Commit: 426083d

Link to invocation

xxi-nv · 2026-06-13T03:25:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-13T03:32:14Z

PR_Github #54005 [ run ] triggered by Bot. Commit: c05c41f Link to invocation

tensorrt-cicd · 2026-06-13T10:16:07Z

PR_Github #54005 [ run ] completed with state SUCCESS. Commit: c05c41f
/LLM/main/L0_MergeRequest_PR pipeline #43090 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-14T23:34:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-14T23:40:29Z

PR_Github #54161 [ run ] triggered by Bot. Commit: c05c41f Link to invocation

tensorrt-cicd · 2026-06-15T01:32:54Z

PR_Github #54161 [ run ] completed with state SUCCESS. Commit: c05c41f
/LLM/main/L0_MergeRequest_PR pipeline #43244 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-15T03:12:00Z

CI failure triage (L0_MergeRequest_PR #43244, commit `c05c41f`)

The latest L0 pipeline reported 7 passed, 3 failed. After investigation, none of the 3 failures are caused by this PR's MegaMoE CuteDSL changes:

1 & 2. `test_moe_comm` / `test_moe_comm_boundary` (GB200) — pre-existing collection bug

These are pytest collection errors, not runtime failures:

ERROR collecting _torch/modules/moe/test_moe_comm.py
TestMoEComm::test_moe_comm: duplicate parametrization of 'mpi_pool_executor'
UNITTEST FAILED with exit code: 4  (1 error in 0.14s)

Root cause: with newer pytest, the shared mpi_pool_executor fixture (tests/unittest/conftest.py) is declared @pytest.fixture(params=[2, 4, 8]) while the tests also do @pytest.mark.parametrize("mpi_pool_executor,config", ..., indirect=["mpi_pool_executor"]) — the two together are rejected as a "duplicate parametrization" at collection time.

This is not introduced here: test_moe_comm.py, the mpi_pool_executor fixture, and l0_gb200_multi_gpus.yml are all identical between main and this branch — this PR does not touch any of them. It is a main-wide infra issue already being fixed by #15354 ([None][test] fix MPI pool fixture parametrization), whose own reported failures are non-MoE tests (pixtral / mm_encoder), confirming the bug is global.

3. `TestDeepSeekV3Lite::test_nvfp4_4gpus_online_eplb[fp8kv=True-moe_backend=WIDEEP]` (GB200)

"Test terminated unexpectedly" on the WIDEEP backend — not the MegaMoE CuteDSL path added by this PR; a separate infra/flaky timeout. The previous run's failures (NemotronV3Ultra HF _remote_code.lock contention, dwdp Address already in use) did not reproduce, consistent with flakiness.

Next step

Rebase once #15354 merges to clear the test_moe_comm collection error, and re-trigger CI for the WIDEEP eplb test. No changes to the MegaMoE CuteDSL kernels are required for these failures.

xxi-nv · 2026-06-15T23:27:19Z

/bot run --disable-fail-fast

xxi-nv · 2026-06-15T23:33:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-15T23:39:51Z

PR_Github #54376 [ run ] triggered by Bot. Commit: 0359542 Link to invocation

tensorrt-cicd · 2026-06-16T04:34:54Z

PR_Github #54376 [ run ] completed with state FAILURE. Commit: 0359542
/LLM/main/L0_MergeRequest_PR pipeline #43447 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-06-16T04:57:05Z

CI failure triage (L0_MergeRequest_PR #43447, commit `0359542`)

The L0 pipeline reported 2 test failures. After investigation, neither failure is caused by this PR's MegaMoE CuteDSL changes — both are infrastructure / flaky Test terminated unexpectedly failures unrelated to this PR.

1. `test_auto_dtype[nvfp4]` (DGX_B200-PyTorch-2) — known global flaky

Failure reason: Test terminated unexpectedly (process crash/timeout, not an assertion/accuracy error).

This test is a CI-wide known flaky: over the last 10 days it failed with the exact same Test terminated unexpectedly message across 33 distinct builds spanning 33 unrelated PRs (e.g. #14160, #15386, #14280, #15181, #13170, #15356, #15014, #14751, …), with 33 fail / 108 pass overall. This PR is simply another build that happened to hit it.

2. `test_bfloat16_4gpus[tp2pp2-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]` (GB200-4_GPUs-PyTorch-4) — single flaky

TestDeepSeekV3Lite bfloat16 GSM8K accuracy test (TP2/PP2 + MTP nextn=2). Failure reason: Test terminated unexpectedly. Over the last 10 days this case is 117 passed / 1 failed, the single failure being this very build — a one-off flaky on a resource-heavy GB200 4-GPU + MTP configuration. This is the bf16 path and does not exercise the NVFP4 MegaMoE CuteDSL code added here.

Why this is not caused by this PR

This PR changes 46 files, all confined to tensorrt_llm/_torch/cute_dsl_kernels (19), tensorrt_llm/_torch/modules (MoE, 9), custom_ops (2), autotuner.py, plus MoE unit tests, test-list/waives, and MoE microbenchmarks. It does not touch test_llm_api_pytorch.py, test_cli_flow.py, the DeepSeek/Llama PyTorch model implementations, attention, MTP, or KV-cache — i.e. none of the code paths these two failing tests exercise.

Next step

Both failures are infra/flaky and unrelated to this PR's MegaMoE CuteDSL changes. Re-triggering CI.

xxi-nv · 2026-06-16T04:57:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T05:02:41Z

PR_Github #54465 [ run ] triggered by Bot. Commit: 0359542 Link to invocation

tensorrt-cicd · 2026-06-16T15:26:35Z

PR_Github #54465 [ run ] completed with state SUCCESS. Commit: 0359542
/LLM/main/L0_MergeRequest_PR pipeline #43534 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-17T00:08:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T00:15:22Z

PR_Github #54690 [ run ] triggered by Bot. Commit: 0359542 Link to invocation

tensorrt-cicd · 2026-06-17T03:39:37Z

PR_Github #54690 [ run ] completed with state FAILURE. Commit: 0359542
/LLM/main/L0_MergeRequest_PR pipeline #43722 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mzweilz · 2026-06-17T05:49:02Z

/bot skip --comment "The previous --disable-fail-fast CI job with the same commit passed. And the only error in the latest result is a known issue unrelated to this PR."

tensorrt-cicd · 2026-06-17T05:55:09Z

PR_Github #54772 [ skip ] triggered by Bot. Commit: 0359542 Link to invocation

tensorrt-cicd · 2026-06-17T06:06:27Z

PR_Github #54772 [ skip ] completed with state SUCCESS. Commit: 0359542
Skipping testing for commit 0359542

Link to invocation

Signed-off-by: xxi <xxi@nvidia.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>

xxi-nv requested review from a team as code owners May 27, 2026 03:08

xxi-nv requested review from hyukn, mlefeb01, tburt-nv and yuxianq May 27, 2026 03:08

github-actions Bot assigned xxi-nv May 27, 2026

xxi-nv requested review from Barry-Delaney, QiJune, ZhanruiSunCh, lfr-0531 and yuanjingx87 May 27, 2026 03:15

xxi-nv force-pushed the megamoe_cutedsl_nvfp4_v2 branch from aa3a318 to ff947a8 Compare May 27, 2026 03:17

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

xxi-nv changed the title ~~[None][feat] Add MegaMoECuteDsl NVFP4 MoE backend (preview, do not use)~~ [Draft][None][feat] Add MegaMoECuteDsl NVFP4 MoE backend May 27, 2026

xxi-nv force-pushed the megamoe_cutedsl_nvfp4_v2 branch from ff947a8 to 41b3aa8 Compare May 27, 2026 03:22

xxi-nv changed the title ~~[Draft][None][feat] Add MegaMoECuteDsl NVFP4 MoE backend~~ [Draft][TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend May 27, 2026

xxi-nv force-pushed the megamoe_cutedsl_nvfp4_v2 branch 2 times, most recently from 9134a9b to e736f64 Compare June 3, 2026 04:02

xxi-nv force-pushed the megamoe_cutedsl_nvfp4_v2 branch 6 times, most recently from d1e8b8b to d686fd4 Compare June 5, 2026 05:18

Merge branch 'main' into megamoe_cutedsl_nvfp4_v2

c05c41f

Merge branch 'main' into megamoe_cutedsl_nvfp4_v2

0359542

xxi-nv enabled auto-merge (squash) June 16, 2026 09:32

lfr-0531 approved these changes Jun 16, 2026

View reviewed changes

hyukn approved these changes Jun 17, 2026

View reviewed changes

xxi-nv merged commit 2772b99 into NVIDIA:main Jun 17, 2026
8 checks passed

xinhe-nv pushed a commit to tensorrt-cicd/TensorRT-LLM that referenced this pull request Jun 23, 2026

[TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend (NVIDIA#14608)

38ba775

Signed-off-by: xxi <xxi@nvidia.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>

xinhe-nv pushed a commit to tensorrt-cicd/TensorRT-LLM that referenced this pull request Jun 24, 2026

[TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend (NVIDIA#14608)

4402b0a

Signed-off-by: xxi <xxi@nvidia.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>

Uh oh!

Conversation

xxi-nv commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

xxi-nv commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

xxi-nv commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

xxi-nv commented Jun 15, 2026

CI failure triage (L0_MergeRequest_PR #43244, commit c05c41f)

1 & 2. test_moe_comm / test_moe_comm_boundary (GB200) — pre-existing collection bug

3. TestDeepSeekV3Lite::test_nvfp4_4gpus_online_eplb[fp8kv=True-moe_backend=WIDEEP] (GB200)

Next step

Uh oh!

xxi-nv commented Jun 15, 2026

Uh oh!

xxi-nv commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

xxi-nv commented Jun 16, 2026

CI failure triage (L0_MergeRequest_PR #43447, commit 0359542)

1. test_auto_dtype[nvfp4] (DGX_B200-PyTorch-2) — known global flaky

2. test_bfloat16_4gpus[tp2pp2-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] (GB200-4_GPUs-PyTorch-4) — single flaky

Why this is not caused by this PR

Next step

Uh oh!

xxi-nv commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

xxi-nv commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

mzweilz commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

xxi-nv commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading

CI failure triage (L0_MergeRequest_PR #43244, commit `c05c41f`)

1 & 2. `test_moe_comm` / `test_moe_comm_boundary` (GB200) — pre-existing collection bug

3. `TestDeepSeekV3Lite::test_nvfp4_4gpus_online_eplb[fp8kv=True-moe_backend=WIDEEP]` (GB200)

CI failure triage (L0_MergeRequest_PR #43447, commit `0359542`)

1. `test_auto_dtype[nvfp4]` (DGX_B200-PyTorch-2) — known global flaky

2. `test_bfloat16_4gpus[tp2pp2-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]` (GB200-4_GPUs-PyTorch-4) — single flaky